Introduction to Machine Learning

AI Bootcamp


Mork Mongkul

Introduction to Machine Learning

What is Machine Learning?

• Mathematically well-defined and solves reasonably narrow tasks.
• Usually construct predictive models from data, instead of explicitly programming them.
• “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”
— Tom Mitchell, Carnegie Mellon University, 1998

• “Field of study that gives computers the ability to learn without being explicitly programmed.”
— Arthur Samuel (1959)

Machine Learning ML Meme

Machine Learning ML Meme

Why Machine Learning?

Machine Learning is transforming industries and daily life. Some key applications include:

• Search engines (e.g. Google)
• Recommender systems (e.g. Netflix)
• Automatic translation (e.g. Google Translate)
• Speech understanding (e.g. Siri, Alexa)
• Game playing (e.g. AlphaGo)
• Self-driving cars
• Personalized medicine
• Progress in all sciences: Genetics, astronomy, chemistry, neurology, physics, …

Spam Email Recommendation System Face Recognition House Price Prediction Fraud Detection

AI, ML, and DL

Many people are confused what these terms actually mean.
And what does all this have to do with statistics?

Artificial Intelligence

• General term for very large and rapidly developing field.
• No strict definition, but often used when machines perform tasks that could only be solved by humans or are very difficult and assumed to require “intelligence”.
• Started in the 1940s – when the computer was invented. Turing and von Neumann immediately asked: If we can formalize computation, can we use that to formalize “thinking”?
• Includes ML, NLP, computer vision, robotics, planning, search, intelligent agents, …
• Sometimes misused as a “hype” term for ML or … basic data analysis.
• Or people refer to the fascinating developments in the area of foundation models

AI

Machine Learning

• Subfield of AI that investigates methods that allow computers to learn.
• Focus on: How can we let machines learn?
• Statistical learning theory: How can we measure and guarantee that a machine learns?
• Includes ML, NLP, computer vision, robotics, planning, search, intelligent agents, …
• Sometimes misused as a “hype” term for … basic data analysis.
• Or people refer to the fascinating developments in the area of foundation models

Machine Learning

Deep Learning

• Subfield of ML which studies neural networks.
• Artificial neural networks are roughly inspired by the human brain, but we treat them as useful, mathematical models.
• Studied for decades (start in the 1940/50s). Uses more layers, might use specific neurons, e.g., for images, many computational improvements to train on large data.
• Can be used on tabular data, but typical applications are images, texts or signals.
• Last 15-20 years have produced remarkable results and imitations of human ability, where the result looked intelligent.
“Any sufficiently advanced technology is indistinguishable from magic.”
Arthur C. Clarke’s 3rd law

Deep Learning Neural Network 1 Neural Network 2

ML vs Statistics

• Historically developed as different fields, but many methods and concepts are pretty much the same.
• ML: Rather accurate predictions with more complex models.
• Stats: More interpreting relationships and sound inference.
• Now: Both basically work on same problems with same tools.
• Communities are still divided.
• Often different terminology for the same concepts.
• Most parts of ML we could also call: Nonparametric statistics plus efficient numerical optimization.
• Personal opinion: Nowadays few practical differences, seeing differences instead of commonalities mainly holds you back.

Stats Meme

Type of Machine Learning

Type of Machine Learning

Supervised Learning: learn a model from labeled data (ground truth)
Given a new input X, predict the right output y
Given examples of stars and galaxies, identify new objects in the sky

Unsupervised Learning: explore the structure of the data (X) to extract meaningful information
Given inputs X, find which ones are special, similar, anomalous, …

Semi-Supervised Learning: learn a model from (few) labeled and (many) unlabeled examples
Unlabeled examples add information about which new examples are likely to occur

Reinforcement Learning: develop an agent that improves its performance based on interactions with the environment

Supervised Machine Learning

• Learn a model from labeled training data, then make predictions
• Supervised: we know the correct/desired outcome (label)
• Subtypes: classification (predict a class) and regression (predict a numeric value)
• Most supervised algorithms that we will see can do both

Supervised Learning Example

Type of Supervised Machine Learning

Supervised learning can be applied to two main types of problems:
Classification: Where the output is a categorical variable (e.g., spam vs. non-spam emails, yes vs. no).
Regression: Where the output is a continuous variable (e.g., predicting house prices, stock).

Regression vs Classification

Regression

Regression

Regression is a type of supervised machine learning where algorithms learn from the data to predict continuous values such as sales, salary, weight, or temperature. For example:A dataset containing features of the house such as lot size, number of bedrooms, number of baths, neighborhood, etc. and the price of the house, a Regression algorithm can be trained to learn the relationship between the features and the price of the house.

• Predict a continuous value.
• Target variable is numeric
• Some algorithms can return a confidence interval
• Find the relationship between predictors and the target.

Regression Example

Regression Algorithms

There are many machine learning algorithms that can be used for regression tasks. Some of them are:
• Linear Regression
• Multiple Regression
• Decision Tree
• Random Forest
• Gradient Boosting Regression

Linear Regression Multiple Regression Decision Tree Random Forest Gradient Boosting

Linear Regression

Linear Regression

Linear Regression is a supervised learning algorithm that is used to model the relationship between a dependent variable and an independent variable. The algorithm finds the best fit straight line relationship (linear equation) between the two variables. This statistical method is then used to predict the outcome of future events and is quite useful for predictive analysis.

• Goal: We want to predict a continuous number (e.g., House Price, Temperature, Stock Value) based on input data.
• Input (\(X\)): Features (Square footage, number of rooms, location)
• Output (\(y\)): The Target (Price)

Regression Example

Linear Regression:Example1

Regression Example

Regression Example

Linear Regression:Example1

Training Set

Notation:

\(M\) = number of training examples
\(x\) = input variable / feature
\(y\) = output/target variable
\((x, y)\) = one training example
\((x^{(i)}, y^{(i)})\) = the \(i\)th training example

Regression Example

Linear Regression:Example1

Model Representation & Cost Function

Model Representation:

Regression Example

Cost Function(Hypothesis):

• Hypothesis: \(h(x) = ax+b\), where \(a\) and \(b\) are called parameters

Regression Example

Regression Example

Linear Regression:Example1

Cost Function(Cont.)

• Hypothesis: \(h(x) = ax+b\)

Regression Example

Linear Regression:Example1

Cost Function(Cont.)

Goal: Choose \(a\) and \(b\) so that : \(h(x) = ax+b\) is close to \(y\) for the training example \((x,y)\)

Regression Example

For each \((x^{(i)}, y^{(i)})\): Minimize \(|h(x^{(i)}) - y^{(i)}|\)

\[\Rightarrow \text{Minimize } \frac{1}{m} \sum_{i=1}^m |h(x^{(i)}) - y^{(i)}|\]

Squared Error Function:

\[J(a, b) = \frac{1}{m} \sum_{i=1}^m \left(h(x^{(i)}) - y^{(i)}\right)^2\] \[\Rightarrow \min_{a, b} J(a, b)\]

Linear Regression:Example1

Gradient Descent

Consider, we have cost function \(J(a,b)\), and our goal is to minimize \(\min_{a, b} J(a, b)\)

Regression Example

Algorithm Outline:

- Start with some value of \(a\) and \(b\)
- Keep changing \(a\) and \(b\) to reduce \(J (a, b)\) until hopefully we end up at a minimum

Things to consider:

- choose the learning rate \(\alpha\)
- Global minimum vs Local minimum

Linear Regression:Exercise 1

Land Price Prediction

Given a dataset of land price as illustrated in the table below, find a linear regression model which fits the data. Train the model using the gradient descent algorithm(implement from scratch).

Regression Example

Linear Regression:Example2

LR with Multiple Variables

Regression Example

Notation

- \(n\) = number of features
- \(x^{(i)}\) = input features of the \(i\)th example
- \(x_j^{(i)}\) = value of feature \(j\) of the \(i\)th example

Regression Example

Linear Regression:Example2

LR with Multiple Variables

Notation

- Hypothesis: \((h(x) = a x_1 + b x_2 + c)\)
or \((h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2)\)
or \((h(x) = \sum_{j=0}^n \theta_j x_j\)\), where \((x_0 = 1) and (n = 2)\)
- Cost function: \((J(\theta_0, \theta_1, \ldots, \theta_n) = J(\theta) = \frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2)\)

Regression Example

Linear Regression:Example2

LR with Multiple Variables:Feature Scaling

Make sure all features are on a similar scale

Regression Example

Linear Regression

Evaluation Approach

Train/Test Split (Data Generalization)

- The dataset is divided into two parts: training set and test set
- Training set: used to learn patterns and fit the model
- Test set: used to evaluate how well the model generalizes to unseen data
- Prevents overfitting and over-optimistic accuracy
- Analogy: Training = practice questions, Test = real exam

Regression Example

Linear Regression:Example2

Hyperparameter

Learning Rate

Learning rate is a floating point number you set that influences how quickly the model converges. If the learning rate is too low, the model can take a long time to converge. However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimize the loss. The goal is to pick a learning rate that’s not too high nor too low so that the model converges quickly. The ideal learning rate helps the model to converge within a reasonable number of iterations.

In Figure on the right, the loss curve shows the model significantly improving during the first 20 iterations before beginning to converge

Regression Example

In contrast, a learning rate that’s too small can take too many iterations to converge. In this second figure, loss graph showing a model trained with a small learning rate, the loss curve shows the model making only minor improvements after each iteration

Regression Example

A learning rate that’s too large never converges because each iteration either causes the loss to bounce around or continually increase. In the third figure, Loss graph showing a model trained with a learning rate that’s too big, where the loss curve fluctuates wildly, going up and down as the iterations increase.

Regression Example

The loss curve shows the model decreasing and then increasing loss after each iteration, and in the 4th figure, the loss increases at later iterations. Loss graph showing a model trained with a learning rate that’s too big, where the loss curve drastically increases in later iterations.

Regression Example

Linear Regression:Example2

Hyperparameter

Batch Size

Batch size is a hyperparameter that refers to the number of examples the model processes before updating its weights and bias. You might think that the model should calculate the loss for every example in the dataset before updating the weights and bias. However, when a dataset contains hundreds of thousands or even millions of examples, using the full batch isn’t practical.

Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent:

Stochastic gradient descent (SGD):

Stochastic gradient descent uses only a single example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy. “Noise” refers to variations during training that cause the loss to increase rather than decrease during an iteration. The term “stochastic” indicates that the one example comprising each batch is chosen at random. Notice in the image on the right how loss slightly fluctuates as the model updates its weights and bias using SGD, which can lead to noise in the loss graph:

Regression Example

Tip

Note that using stochastic gradient descent can produce noise throughout the entire loss curve, not just near convergence.

Mini-batch stochastic gradient descent (mini-batch SGD)::

Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For number of data points, the batch size can be any number greater than 1 and less than . The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration. Determining the number of examples for each batch depends on the dataset and the available compute resources. In general, small batch sizes behaves like SGD, and larger batch sizes behaves like full-batch gradient descent.

Regression Example

Tip

When training a model, you might think that noise is an undesirable characteristic that should be eliminated. However, a certain amount of noise can be a good thing.

Linear Regression:Example2

Hyperparameter

Epochs

During training, an epoch means that the model has processed every example in the training set once. For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch. Training typically requires many epochs. That is, the system needs to process every example in the training set multiple times. The number of epochs is a hyperparameter you set before the model begins training. In many cases, you’ll need to experiment with how many epochs it takes for the model to converge. In general, more epochs produces a better model, but also takes more time to train.

Regression Example

Linear Regression:Example2

Evaluation Approach

Error-Based Metrics (Predictive Accuracy)

Mean Absolute Error (MAE)

- Measures the average absolute difference between predicted and actual values
- Uses the same unit as the target variable
- Less sensitive to outliers than MSE

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]

Mean Squared Error (MSE)

- Squares prediction errors
- Large errors are penalized heavily
- Sensitive to outliers
- MSE is commonly used as a loss function during training because it is differentiable.

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]

Root Mean Squared Error (RMSE)

- Represents the typical size of prediction error
- Same unit as the target variable
- Sensitive to outliers
\[\text{RMSE} = \sqrt{\text{MSE}}\]

Linear Regression: Example 2

Evaluation Approach

Goodness-of-Fit Metrics (Model Explanatory Power)

R-Squared (\(R^2\))

- Measures the proportion of variance in the target variable explained by the model
- Compares the model against a baseline that predicts the mean
- Values range from 0 to 1 (higher is better)
- Does NOT measure prediction error

\[ R^2 = 1 - \frac{ \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 }{ \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2 } \]

Adjusted R-Squared

- Adjusted version of \(R^2\) that accounts for the number of predictors
- Penalizes adding irrelevant features
- Essential for Multiple Linear Regression
- Increases only when a new feature improves the model

\[ \text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right) \]

Where: \(n\) = number of samples, \(p\) = number of predictors

Linear Regression: Exercise2

Linear Regression:Exercise2

Multiple Linear Regression

Given a dataset of land price:

1. Build a linear regression which predicts the land price using both the land_area and the distance_to_city feature. (See the dataset in ’land_price_ _1.csv’)
2. Using only the distance feature, build a model with hypothesis \(h(x) = \theta_{0}+\theta_{1}x+\theta_{3}\sqrt{x}\) to predict the land price. (See the dataset in ‘land_price_2.csv’)

Classification

Classification

- Predicts a class label (category), which is discrete and unordered
- Can be binary (e.g., spam / not spam) or multi-class (e.g., letter recognition)
- Many classifiers can return a confidence score per class
- Model predictions create a decision boundary separating the classes

Regression Example

Classification Algorithms

There are many machine learning algorithms that can be used for classification tasks. Some of them are:
• Logistic Regression
• Decision Tree Classifier
• Random Forest Classifier
• Support Vector Machine (SVM)
• K-Nearest Neighbors (KNN)
• Naive Bayes Classifier

Email Spam Detection Handwritten Digit Recognition Medical Diagnosis Sentiment Analysis Credit Card Fraud Detection Wildlife Species Classification

Logistic Regression

Logistics Regression

Regression Example

Regression Example

Logistics Regression

Hypothesis Representation

Regression Example

Logistics Regression

Cost Function

Regression Example

Logistics Regression

Cost Function(cont)

Regression Example

Logistics Regression

Exercise 1

Regression Example

Logistics Regression

Multiclass Classification

Regression Example

Regression Example

Exercise2

Regression Example

Questions?

Instinct Institute

Mork Mongkul